Final Project: Yelp Reviews¶

Members: Nadine Alcantara, Andre Ruiz, Rexter Delos Santos, Kameron Wong¶

Introduction¶

When it comes to the world of businesses, having good reviews equals good economical flow in most cases, but just how influential can reviews really be? For our project, we are looking at the effects of positive and negative reviews on businesses using Yelp’s review dataset. With that in mind, we also want to show how successful a business can get based off of initial reviews.

Infrastructure¶

For this project, the main coding language will be Python. Since we are using the over 8 million reviews as a baseline to solve our problem, we will be using Spark so we can be flexible in the ways we download/upload the data from yelp to our iPynb.

Data Schema¶

In [6]:
business = ss.read.json("data/yelp_academic_dataset_business.json")
checkin = ss.read.json("data/yelp_academic_dataset_checkin.json")
review = ss.read.json("data/yelp_academic_dataset_review.json")
tip = ss.read.json("data/yelp_academic_dataset_tip.json")
user = ss.read.json("data/yelp_academic_dataset_user.json")
21/12/16 06:45:03 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                
In [6]:
business.printSchema()
print(business.count())
root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: string (nullable = true)
 |    |-- GoodForKids: string (nullable = true)
 |    |-- GoodForMeal: string (nullable = true)
 |    |-- HairSpecializesIn: string (nullable = true)
 |    |-- HappyHour: string (nullable = true)
 |    |-- HasTV: string (nullable = true)
 |    |-- Music: string (nullable = true)
 |    |-- NoiseLevel: string (nullable = true)
 |    |-- Open24Hours: string (nullable = true)
 |    |-- OutdoorSeating: string (nullable = true)
 |    |-- RestaurantsAttire: string (nullable = true)
 |    |-- RestaurantsCounterService: string (nullable = true)
 |    |-- RestaurantsDelivery: string (nullable = true)
 |    |-- RestaurantsGoodForGroups: string (nullable = true)
 |    |-- RestaurantsPriceRange2: string (nullable = true)
 |    |-- RestaurantsReservations: string (nullable = true)
 |    |-- RestaurantsTableService: string (nullable = true)
 |    |-- RestaurantsTakeOut: string (nullable = true)
 |    |-- Smoking: string (nullable = true)
 |    |-- WheelchairAccessible: string (nullable = true)
 |    |-- WiFi: string (nullable = true)
 |-- business_id: string (nullable = true)
 |-- categories: string (nullable = true)
 |-- city: string (nullable = true)
 |-- hours: struct (nullable = true)
 |    |-- Friday: string (nullable = true)
 |    |-- Monday: string (nullable = true)
 |    |-- Saturday: string (nullable = true)
 |    |-- Sunday: string (nullable = true)
 |    |-- Thursday: string (nullable = true)
 |    |-- Tuesday: string (nullable = true)
 |    |-- Wednesday: string (nullable = true)
 |-- is_open: long (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- name: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- stars: double (nullable = true)
 |-- state: string (nullable = true)

[Stage 5:=======>                                                   (1 + 7) / 8]
160585
                                                                                
In [7]:
checkin.printSchema()
print(checkin.count())
root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)

[Stage 7:====================================>                      (5 + 3) / 8]
138876
                                                                                
In [8]:
review.printSchema()
print(review.count())
root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)

[Stage 9:======================================================>  (50 + 2) / 52]
8635403
                                                                                

Data Visualizations¶

Total Number of Business Categories¶

In [13]:
categories = (business
             .select("business_id", "categories")
             .withColumn("categories", F.explode(F.split(F.col("categories"), ", "))))

categories.show()
+--------------------+--------------------+
|         business_id|          categories|
+--------------------+--------------------+
|6iYb2HFDywm3zjuRg...|          Gastropubs|
|6iYb2HFDywm3zjuRg...|                Food|
|6iYb2HFDywm3zjuRg...|        Beer Gardens|
|6iYb2HFDywm3zjuRg...|         Restaurants|
|6iYb2HFDywm3zjuRg...|                Bars|
|6iYb2HFDywm3zjuRg...|American (Traditi...|
|6iYb2HFDywm3zjuRg...|            Beer Bar|
|6iYb2HFDywm3zjuRg...|           Nightlife|
|6iYb2HFDywm3zjuRg...|           Breweries|
|tCbdrRPZA0oiIYSmH...|               Salad|
|tCbdrRPZA0oiIYSmH...|                Soup|
|tCbdrRPZA0oiIYSmH...|          Sandwiches|
|tCbdrRPZA0oiIYSmH...|               Delis|
|tCbdrRPZA0oiIYSmH...|         Restaurants|
|tCbdrRPZA0oiIYSmH...|               Cafes|
|tCbdrRPZA0oiIYSmH...|          Vegetarian|
|bvN78flM8NLprQ1a1...|            Antiques|
|bvN78flM8NLprQ1a1...|             Fashion|
|bvN78flM8NLprQ1a1...|                Used|
|bvN78flM8NLprQ1a1...|Vintage & Consign...|
+--------------------+--------------------+
only showing top 20 rows

In [14]:
categories_count = (categories
                   .groupBy("categories")
                   .count()
                   .sort("count", ascending=False))

categories_count.show()
[Stage 19:==================================================>   (186 + 9) / 200]
+--------------------+-----+
|          categories|count|
+--------------------+-----+
|         Restaurants|50763|
|                Food|29469|
|            Shopping|26205|
|       Beauty & Spas|16574|
|       Home Services|16465|
|    Health & Medical|15102|
|      Local Services|12192|
|           Nightlife|11990|
|                Bars|10741|
|          Automotive|10119|
|Event Planning & ...| 9644|
|         Active Life| 9231|
|        Coffee & Tea| 7725|
|          Sandwiches| 7272|
|             Fashion| 6599|
|American (Traditi...| 6541|
|         Hair Salons| 5900|
|               Pizza| 5756|
|     Hotels & Travel| 5703|
|  Breakfast & Brunch| 5505|
+--------------------+-----+
only showing top 20 rows

                                                                                
In [15]:
categories_df = categories_count.limit(20).toPandas()

categories_histogram = px.histogram(categories_df, y="categories", x="count")

categories_histogram.show()
                                                                                

Total Reviews Per Year¶

In [17]:
review_dates = (review
                .join(business, "business_id")
                .select("review_id", "business_id", "date")
                .sort("business_id", "date", ascending=False)
                .withColumn("month", F.split(F.col("date"), "-")[1])
                .withColumn("year", F.split(F.col("date"), "-")[0]))

review_dates.show()

review_count_by_date = (review_dates
                        .select("month", "year")
                        .sort("year", "month", ascending=False))

review_count_by_date.groupBy("month").pivot("year").count().sort("month", ascending=True).show()

review_by_year = (review_count_by_date
                 .groupBy("year").count()
                 .sort("year"))
review_by_year.show()
                                                                                
+--------------------+--------------------+-------------------+-----+----+
|           review_id|         business_id|               date|month|year|
+--------------------+--------------------+-------------------+-----+----+
|RIlyZDUUL7dn-wX9R...|zzzKmD9Mj6WtJwJUh...|2012-01-05 07:25:19|   01|2012|
|Beiu8FUpabvKp_tIj...|zzzKmD9Mj6WtJwJUh...|2011-07-06 03:11:19|   07|2011|
|NsbPfBB7VcKavo2kP...|zzzKmD9Mj6WtJwJUh...|2011-01-30 07:26:01|   01|2011|
|ZLg9JZBm8fti3kdcK...|zzzKmD9Mj6WtJwJUh...|2010-12-26 08:06:25|   12|2010|
|_PDg4GKwqstcU4jqS...|zzzKmD9Mj6WtJwJUh...|2010-05-05 21:14:56|   05|2010|
|7Me9k1V0KNyqAds43...|zzzKmD9Mj6WtJwJUh...|2010-03-07 20:52:10|   03|2010|
|PEgXvQE2ZKGQZWqt5...|zzzKmD9Mj6WtJwJUh...|2009-10-22 19:53:49|   10|2009|
|coFCcXcoVnTLwNMct...|zzzKmD9Mj6WtJwJUh...|2009-10-13 08:08:57|   10|2009|
|79BNQZEWEcZjZe4G7...|zzxIO4bjWCQrvCJ0O...|2020-06-05 12:17:10|   06|2020|
|DWbIic86a-H_8nau6...|zzxIO4bjWCQrvCJ0O...|2020-01-09 21:53:36|   01|2020|
|6LVhP-V7QtsjN34Ao...|zzxIO4bjWCQrvCJ0O...|2019-08-15 22:00:46|   08|2019|
|h5xb4XzKaFr_hBS1_...|zzxIO4bjWCQrvCJ0O...|2018-05-09 11:46:59|   05|2018|
|64mGlTN9U8uh3qLQq...|zzxIO4bjWCQrvCJ0O...|2011-11-02 14:02:07|   11|2011|
|K0DGtUbEh9mKHrLh4...|zzwK-TJsCJX5wZrdt...|2016-08-02 16:31:57|   08|2016|
|uQzssdiURqnvdQVPd...|zzwK-TJsCJX5wZrdt...|2016-07-24 19:06:03|   07|2016|
|xgfrMLg0mBFkbzZyN...|zzwK-TJsCJX5wZrdt...|2016-07-19 04:10:43|   07|2016|
|7aAloTeX7KceHKivy...|zzwK-TJsCJX5wZrdt...|2016-07-19 01:17:27|   07|2016|
|d6KrSTDkBcRmqBZMh...|zzwK-TJsCJX5wZrdt...|2016-07-18 19:54:03|   07|2016|
|tseKwcVWB-PBSLzbK...|zzwK-TJsCJX5wZrdt...|2015-05-06 14:45:58|   05|2015|
|nw1N6IVukv-Ez_JTL...|zzwK-TJsCJX5wZrdt...|2014-05-06 16:05:21|   05|2014|
+--------------------+--------------------+-------------------+-----+----+
only showing top 20 rows

                                                                                
+-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+
|month|2004|2005|2006|2007| 2008| 2009| 2010| 2011| 2012| 2013| 2014| 2015| 2016| 2017|  2018| 2019| 2020| 2021|
+-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+
|   01|null|   1| 863|4758|10197|17799|24147|35023|40899|44374|54145|70979|82320|80022| 85867|86871|76453|44461|
|   02|null|   4|1267|4466| 9863|15605|23299|33158|37994|41225|49677|63486|78146|75541| 80386|81578|74612| null|
|   03|null|  25|1497|4543|10755|17154|26245|37661|41296|44404|58557|76281|81878|87367| 94662|93600|50136| null|
|   04|null|  48|1873|4557|10733|15891|24116|35072|37682|42412|56016|72023|80990|87630| 93447|86074|24036| null|
|   05|null|  70|1754|6012|10903|16180|23697|35669|38974|44925|58678|78086|84232|91233| 94852|91534|33660| null|
|   06|null| 502|1593|6576|11580|17062|25276|36004|39974|44714|61867|78012|83452|91644| 97191|93595|43657| null|
|   07|null|2412|1546|7575|14964|19087|29174|40410|43052|49496|69241|85882|90116|99411|107107|97029|50353| null|
|   08|null|2271|2028|6668|17138|21619|32210|41051|42557|53526|71823|89623|86628|93197|100206|94767|54419| null|
|   09|null| 320|2101|6367|14139|18689|27601|35003|37909|48239|62774|76155|77204|83352| 86467|83076|51666| null|
|   10|  38| 290|2514|6662|13553|19210|28034|35750|39256|49242|63055|77803|77082|85783| 84966|82121|52815| null|
|   11|   2| 212|3308|6485|13646|18389|26970|33680|37203|46552|60272|70896|68240|77382| 77851|73277|44667| null|
|   12|  12| 284|3475|7247|12965|17112|26814|32711|35645|46631|60014|68303|70239|76995| 81333|74047|45417| null|
+-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+

[Stage 35:======================================================> (51 + 1) / 52]
+----+-------+
|year|  count|
+----+-------+
|2004|     52|
|2005|   6439|
|2006|  23819|
|2007|  71916|
|2008| 150436|
|2009| 213797|
|2010| 317583|
|2011| 431192|
|2012| 472441|
|2013| 555740|
|2014| 726119|
|2015| 907529|
|2016| 960527|
|2017|1029557|
|2018|1084335|
|2019|1037569|
|2020| 601891|
|2021|  44461|
+----+-------+

                                                                                
In [18]:
review_by_year_pd = review_by_year.toPandas()

fig = px.bar(review_by_year_pd, x="year", y="count", title="Total Reviews Per Year")
fig.show()
                                                                                

Total Check-Ins Per Year¶

In [19]:
checkin_date = (checkin
                .join(business, "business_id")
                .withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
                .select("business_id", "date", "checkin_count")
                .sort("checkin_count", ascending=False)
                .withColumn("month", F.split(F.col("date"), "-")[1])
                .withColumn("year", F.split(F.col("date"), "-")[0]))
checkin_date.show()
checkin_counts_by_year = (checkin_date
                         .groupBy("year")
                         .sum("checkin_count")
                         .withColumnRenamed("sum(checkin_count)", "count")
                         .sort("year"))
checkin_counts_by_year.show()
                                                                                
+--------------------+--------------------+-------------+-----+----+
|         business_id|                date|checkin_count|month|year|
+--------------------+--------------------+-------------+-----+----+
|7sxYa0-TwWeWGFr5C...|2010-01-17 02:00:...|       150005|   01|2010|
|EqUqaLJxMDxIqvnma...|2010-01-16 04:58:...|        75511|   01|2010|
|2PxZ-fICnd432NJHe...|2010-01-21 17:52:...|        63982|   01|2010|
|UlndVD4tezU3FACjZ...|2010-01-17 13:19:...|        54657|   01|2010|
|vITc1UkWLvG9pVdaO...|2010-01-18 23:59:...|        54361|   01|2010|
|cyBm7p3D2RAoGlDn-...|2010-01-30 23:31:...|        25208|   01|2010|
|NvAYQvjLxwGC-kcWz...|2010-01-16 17:37:...|        24017|   01|2010|
|4CxF8c3MB7VAdY8zF...|2010-01-17 23:35:...|        20291|   01|2010|
|OPfgKOm_n-ajUo3qj...|2010-01-18 05:12:...|        19941|   01|2010|
|m3DeBd0NSbAGDjUOk...|2010-01-19 18:16:...|        18388|   01|2010|
|BOFD5UmhHvCn_XH3_...|2010-02-07 02:27:...|        17033|   02|2010|
|C_k727ws82eMe9xtJ...|2010-01-18 21:50:...|        16848|   01|2010|
|OQ2oHkcWA8KNC1Lsv...|2010-02-13 17:19:...|        15946|   02|2010|
|qeuJgUdcmL5yAweOs...|2010-01-16 22:32:...|        15451|   01|2010|
|Tjgte2R4e-3JWj4aA...|2010-02-11 18:49:...|        15378|   02|2010|
|zioLxtBc9THNS2TOn...|2010-01-17 04:40:...|        14934|   01|2010|
|5y2zZGIE2a4tuEJ5F...|2010-01-16 02:57:...|        14153|   01|2010|
|5xS76cdT7THG14H_W...|2010-02-21 14:55:...|        14151|   02|2010|
|PrsvO1rzkgg6qFizl...|2010-01-18 18:40:...|        13360|   01|2010|
|o_L9Ss4boqq6ZEF9x...|2012-04-14 19:15:...|        12202|   04|2012|
+--------------------+--------------------+-------------+-----+----+
only showing top 20 rows

[Stage 46:====================================>                     (5 + 3) / 8]
+----+--------+
|year|   count|
+----+--------+
|2010|11717005|
|2011| 1634183|
|2012| 1255754|
|2013|  980808|
|2014|  902834|
|2015|  756902|
|2016|  585924|
|2017|  392881|
|2018|  259931|
|2019|  128699|
|2020|   26019|
|2021|     352|
+----+--------+

                                                                                
In [20]:
checkin_counts_by_year_pd = checkin_counts_by_year.toPandas()

fig = px.bar(checkin_counts_by_year_pd, x="year", y="count", title="Total Check-Ins Per Year")
fig.show()
                                                                                
In [21]:
user_reviews_location = (user
                           .join(review, 'user_id')
                           .join(business, 'business_id'))

user_reviews_location.select('user_id', 'business_id', 'latitude', 'longitude').show()
[Stage 56:======================================================> (51 + 1) / 52]
+--------------------+--------------------+-------------+--------------+
|             user_id|         business_id|     latitude|     longitude|
+--------------------+--------------------+-------------+--------------+
|--hJsDxzXZURcLxaL...|Ln-8CbKGZGmF-GCqM...|   33.7787398|    -84.409326|
|--hJsDxzXZURcLxaL...|QS_ojFa7FpfOVThrs...|    33.923006|   -84.3399804|
|-0Ji0nOyFe-4yo8BK...|1hJEqPHe65oJmO6sc...|   30.3479823|   -97.7353261|
|-0KAIqdO-DhrIXcY9...|WyIk1ZWy31Qaa8xUi...|   30.0135114|   -97.8635298|
|-0KAIqdO-DhrIXcY9...|GtlDQ8GG8ykG1ion3...|30.0127452517|-97.8907751147|
|-0KAIqdO-DhrIXcY9...|OUZIlIytYIiqPj2RF...|   30.0169045|   -97.8633635|
|-0KAIqdO-DhrIXcY9...|_Fs1js9sOEFDJ2Txg...|30.0116336043|-97.8900940009|
|-0KAIqdO-DhrIXcY9...|k_JMC4M3BJ4X9pC5X...|   30.0708447|    -97.873136|
|-0KAIqdO-DhrIXcY9...|9BBnf7fLTs2EIyDFo...|   30.0280744|   -97.8765501|
|-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...|   30.2288371|    -97.788686|
|-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...|   30.2288371|    -97.788686|
|-0KAIqdO-DhrIXcY9...|KY-EXj--9PlSBGhOS...|   30.0130756|   -97.8917627|
|-0KAIqdO-DhrIXcY9...|N2lTmVjYrQrw1nl9M...|   30.1769379|   -97.7904271|
|-0atCwlr6SSpYniMw...|juSETXWf57AtH5rh2...|   30.2619868|    -97.722894|
|-1KKYzibGPyUX-Mwk...|ROa5tRU4lUn1ffu0H...| 33.757101775|-84.3881696716|
|-1KKYzibGPyUX-Mwk...|SFqFFIA4Ks2oHfgEA...|    28.440381|   -81.4698224|
|-25JZ2VjrGZfXR8Ea...|S0QEoRla89uZyI1BC...|28.6069310793| -81.209671868|
|-25JZ2VjrGZfXR8Ea...|t5EgwRXG6MZSihdqu...|   28.8114547|   -81.2666383|
|-25JZ2VjrGZfXR8Ea...|_OvpwIXRqRXkVjCgu...|   30.3954441|   -97.7269882|
|-25JZ2VjrGZfXR8Ea...|x6SoT6Zg8mrIx1uAi...|   30.2928986|    -97.741776|
+--------------------+--------------------+-------------+--------------+
only showing top 20 rows

                                                                                

Distribution of Reviews Across the United States¶

In [11]:
review_counts = (business
                  .select("name", "latitude", "longitude", "review_count", "stars", "state")
                  .sort("stars", "review_count", ascending=False))
review_counts.show()
[Stage 15:=============================>                            (4 + 4) / 8]
+--------------------+-------------+--------------+------------+-----+-----+
|                name|     latitude|     longitude|review_count|stars|state|
+--------------------+-------------+--------------+------------+-----+-----+
|Powell's City of ...|   45.5230969|   -122.681325|        4295|  5.0|   OR|
|     Aviva by Kameel|33.7605381455|-84.3865554648|        1532|  5.0|   GA|
|        Stanley Park|   49.2978842|  -123.1308093|         939|  5.0|   BC|
|Ovation Coffee & Tea|    45.532588|  -122.6810282|         740|  5.0|   OR|
|    Pedal Bike Tours|    45.521572|  -122.6729487|         736|  5.0|   OR|
| Ciao! Pizza & Pasta|   42.3892206|   -71.0408823|         733|  5.0|   MA|
|   1618 Asian Fusion|   30.2453207|   -97.7303193|         641|  5.0|   TX|
|Lake Travis Zipli...|     30.42903|    -97.890062|         584|  5.0|   TX|
|Brakes To Go - Mo...|   30.2729209|   -97.7443863|         577|  5.0|   TX|
|      Jewboy Burgers|   30.3127408|   -97.7150455|         516|  5.0|   TX|
|      Swiss Hibiscus|   45.5587705|   -122.651453|         510|  5.0|   OR|
|Peace Frog Specia...|   30.2648272|   -97.7341597|         509|  5.0|   TX|
|                 Tōv|    45.512267|  -122.6323559|         503|  5.0|   OR|
|Einstein Moving C...|    30.348606|     -97.67326|         465|  5.0|   TX|
|Realty Austin - C...|   30.3154649|   -97.7346205|         441|  5.0|   TX|
|            Car Zone|   42.3748556|   -71.0937018|         426|  5.0|   MA|
|Auto Glass Solutions|    30.380665|      -97.7237|         411|  5.0|   TX|
|Always Available ...|30.2481061334|-97.7615793065|         411|  5.0|   TX|
|T-Loc's Sonora Ho...|   30.3211689|   -97.7395055|         410|  5.0|   TX|
|  Stairhopper Movers|   42.3796489|   -71.0717834|         410|  5.0|   MA|
+--------------------+-------------+--------------+------------+-----+-----+
only showing top 20 rows

                                                                                
In [22]:
review_counts_pd = review_counts.toPandas()
fig = px.density_mapbox(review_counts_pd, lat='latitude', lon='longitude', z='review_count', title="Review Density Heatmap", radius=10, zoom=2.75,
                        mapbox_style="stamen-terrain")
fig.show()
                                                                                

Distribution of Business Check-Ins Across the United States¶

In [7]:
business_checkin = (business
                    .join(checkin, "business_id")
                    .withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
                    .select("business_id", "latitude", "longitude", "checkin_count")
                    .sort("checkin_count", ascending=False))

business_checkin.show()
[Stage 6:>                                                          (0 + 4) / 4]
+--------------------+-------------+---------------+-------------+
|         business_id|     latitude|      longitude|checkin_count|
+--------------------+-------------+---------------+-------------+
|7sxYa0-TwWeWGFr5C...|33.6407227072| -84.4276356697|       150005|
|EqUqaLJxMDxIqvnma...|     45.58979|    -122.595204|        75511|
|2PxZ-fICnd432NJHe...|28.4278388741|  -81.308182904|        63982|
|UlndVD4tezU3FACjZ...|    30.202473|     -97.666861|        54657|
|vITc1UkWLvG9pVdaO...|42.3622379746| -71.0210756391|        54361|
|cyBm7p3D2RAoGlDn-...|    28.417663|     -81.581212|        25208|
|NvAYQvjLxwGC-kcWz...|    28.374694|     -81.549404|        24017|
|4CxF8c3MB7VAdY8zF...|45.5226120114|-122.6730836114|        20291|
|OPfgKOm_n-ajUo3qj...|   45.5230969|    -122.681325|        19941|
|m3DeBd0NSbAGDjUOk...|49.1942850587|  -123.18076195|        18388|
|BOFD5UmhHvCn_XH3_...|    28.370971|     -81.519392|        17033|
|C_k727ws82eMe9xtJ...|   39.9980574|    -82.8838088|        16848|
|OQ2oHkcWA8KNC1Lsv...|   45.5230858|     -122.64168|        15946|
|qeuJgUdcmL5yAweOs...|   45.5246501|   -122.6818687|        15451|
|Tjgte2R4e-3JWj4aA...|   28.3575294|    -81.5582714|        15378|
|zioLxtBc9THNS2TOn...|    45.504723|    -122.632105|        14934|
|5y2zZGIE2a4tuEJ5F...|    30.270557|     -97.753177|        14153|
|5xS76cdT7THG14H_W...|    28.359719|     -81.591313|        14151|
|PrsvO1rzkgg6qFizl...|42.3641813865| -71.0542184385|        13360|
|o_L9Ss4boqq6ZEF9x...|   45.5289444|   -122.6982801|        12202|
+--------------------+-------------+---------------+-------------+
only showing top 20 rows

                                                                                
In [8]:
business_checkin_pd = business_checkin.toPandas()

fig = px.density_mapbox(business_checkin_pd, lat='latitude', lon='longitude', z='checkin_count', title="Check-In Density Heatmap", radius=10, zoom=2.75,
                        mapbox_style="stamen-terrain")
fig.show()
                                                                                

Machine Learning¶

Using: Linear Regression¶

In [25]:
# group reviews and checkins
# get count for each business grouped by months

review_checkin = (review
                  .join(checkin, ["business_id", "date"], "full")
                  .withColumn("year", F.split(F.col("date"), "-")[0])
                  .withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
                  .select("business_id", "year", "combined_count")
                  .groupBy(["business_id", "year"])
                  .sum("combined_count"))

ratings = (review
            .withColumn("year", F.split(F.col("date"), "-")[0])
            .select("business_id", "year", "stars")
            .groupBy(["business_id", "year"])
            .mean("stars"))

ratings_reviews = (review_checkin
                   .join(ratings, ["business_id", "year"], "full"))

                  

ratings_reviews.sort(F.desc("sum(combined_count)")).show()
[Stage 69:=====================================================>(197 + 3) / 200]
+--------------------+----+-------------------+------------------+
|         business_id|year|sum(combined_count)|        avg(stars)|
+--------------------+----+-------------------+------------------+
|7sxYa0-TwWeWGFr5C...|2010|             150157|3.0592105263157894|
|EqUqaLJxMDxIqvnma...|2010|              75582| 4.450704225352113|
|2PxZ-fICnd432NJHe...|2010|              64063|3.2839506172839505|
|UlndVD4tezU3FACjZ...|2010|              54720| 4.111111111111111|
|vITc1UkWLvG9pVdaO...|2010|              54437| 3.486842105263158|
|cyBm7p3D2RAoGlDn-...|2010|              25267| 4.220338983050848|
|NvAYQvjLxwGC-kcWz...|2010|              24095| 4.256410256410256|
|4CxF8c3MB7VAdY8zF...|2010|              20795|3.7876984126984126|
|OPfgKOm_n-ajUo3qj...|2010|              20176| 4.753191489361702|
|m3DeBd0NSbAGDjUOk...|2010|              18432| 4.136363636363637|
|BOFD5UmhHvCn_XH3_...|2010|              17060|3.7777777777777777|
|C_k727ws82eMe9xtJ...|2010|              16875|3.4444444444444446|
|OQ2oHkcWA8KNC1Lsv...|2010|              16146|             4.495|
|qeuJgUdcmL5yAweOs...|2010|              15571|3.7916666666666665|
|Tjgte2R4e-3JWj4aA...|2010|              15411|3.9393939393939394|
|zioLxtBc9THNS2TOn...|2010|              15202| 3.955223880597015|
|5y2zZGIE2a4tuEJ5F...|2010|              14265| 4.223214285714286|
|5xS76cdT7THG14H_W...|2010|              14201|              4.16|
|PrsvO1rzkgg6qFizl...|2010|              13687|4.0672782874617734|
|o_L9Ss4boqq6ZEF9x...|2012|              12381|4.4636871508379885|
+--------------------+----+-------------------+------------------+
only showing top 20 rows

                                                                                
In [26]:
windowval = (Window.partitionBy('business_id').orderBy('date')
             .rangeBetween(Window.unboundedPreceding, 0))
review_count = (review
                .withColumn('rev_count', F.count('stars').over(windowval))
                .withColumn("month", F.split(F.col("date"), "-")[1])
                .withColumn("year", F.split(F.col("date"), "-")[0])
                .select('business_id', 'month', 'year', 'stars', 'rev_count')
                .groupBy(["business_id", "month", "year"])
                .sum("rev_count"))

review_count.show()
[Stage 70:======================================================> (51 + 1) / 52]
+--------------------+-----+----+--------------+
|         business_id|month|year|sum(rev_count)|
+--------------------+-----+----+--------------+
|-36nnCT71XE0InJXK...|   07|2014|             1|
|-36nnCT71XE0InJXK...|   02|2015|             2|
|-36nnCT71XE0InJXK...|   07|2015|             3|
|-36nnCT71XE0InJXK...|   11|2015|             4|
|-36nnCT71XE0InJXK...|   03|2016|             5|
|-36nnCT71XE0InJXK...|   09|2016|             6|
|-36nnCT71XE0InJXK...|   10|2016|            15|
|-36nnCT71XE0InJXK...|   12|2016|             9|
|-36nnCT71XE0InJXK...|   02|2017|            33|
|-36nnCT71XE0InJXK...|   04|2017|            13|
|-36nnCT71XE0InJXK...|   05|2017|            14|
|-36nnCT71XE0InJXK...|   06|2017|            31|
|-36nnCT71XE0InJXK...|   12|2017|            17|
|-36nnCT71XE0InJXK...|   01|2018|            37|
|-36nnCT71XE0InJXK...|   06|2018|            41|
|-36nnCT71XE0InJXK...|   07|2018|            69|
|-36nnCT71XE0InJXK...|   08|2018|            25|
|-36nnCT71XE0InJXK...|   09|2018|            26|
|-36nnCT71XE0InJXK...|   03|2019|            27|
|-36nnCT71XE0InJXK...|   05|2019|            28|
+--------------------+-----+----+--------------+
only showing top 20 rows

                                                                                
In [27]:
windowval = (Window.partitionBy('business_id').orderBy('date')
             .rangeBetween(Window.unboundedPreceding, 0))
review_sum_stars = (review
                .withColumn('running_sum', F.sum('stars').over(windowval))
                .withColumn("month", F.split(F.col("date"), "-")[1])
                .withColumn("year", F.split(F.col("date"), "-")[0])
                .select('business_id', 'month', 'year', 'running_sum')
                .groupBy(["business_id", "month", "year"])
                .sum("running_sum"))

review_sum_stars.show()
[Stage 72:======================================================> (51 + 1) / 52]
+--------------------+-----+----+----------------+
|         business_id|month|year|sum(running_sum)|
+--------------------+-----+----+----------------+
|-36nnCT71XE0InJXK...|   07|2014|             1.0|
|-36nnCT71XE0InJXK...|   02|2015|             2.0|
|-36nnCT71XE0InJXK...|   07|2015|             7.0|
|-36nnCT71XE0InJXK...|   11|2015|             8.0|
|-36nnCT71XE0InJXK...|   03|2016|             9.0|
|-36nnCT71XE0InJXK...|   09|2016|            14.0|
|-36nnCT71XE0InJXK...|   10|2016|            35.0|
|-36nnCT71XE0InJXK...|   12|2016|            21.0|
|-36nnCT71XE0InJXK...|   02|2017|            69.0|
|-36nnCT71XE0InJXK...|   04|2017|            29.0|
|-36nnCT71XE0InJXK...|   05|2017|            31.0|
|-36nnCT71XE0InJXK...|   06|2017|            65.0|
|-36nnCT71XE0InJXK...|   12|2017|            34.0|
|-36nnCT71XE0InJXK...|   01|2018|            75.0|
|-36nnCT71XE0InJXK...|   06|2018|            86.0|
|-36nnCT71XE0InJXK...|   07|2018|           142.0|
|-36nnCT71XE0InJXK...|   08|2018|            50.0|
|-36nnCT71XE0InJXK...|   09|2018|            51.0|
|-36nnCT71XE0InJXK...|   03|2019|            52.0|
|-36nnCT71XE0InJXK...|   05|2019|            53.0|
+--------------------+-----+----+----------------+
only showing top 20 rows

                                                                                
In [28]:
review_count_sum = (review_count
                    .join(review_sum_stars, ['business_id','month', 'year'], 'full')
                    .select('business_id', 'month', 'year', 'sum(rev_count)', 'sum(running_sum)')
                    )
review_count_sum.show()
[Stage 76:====================================================> (193 + 7) / 200]
+--------------------+-----+----+--------------+----------------+
|         business_id|month|year|sum(rev_count)|sum(running_sum)|
+--------------------+-----+----+--------------+----------------+
|--2mEJ63SC_8_08_j...|   11|2016|             6|            14.0|
|--UNNdnHRhsyFUbDg...|   08|2016|          7132|         31341.0|
|--xmN4i0Hoqx0pPtr...|   11|2017|            44|           180.0|
|-0F6zJ4parKu5WMoh...|   08|2016|             6|            26.0|
|-0U6xz5cKY1PB_TJh...|   05|2015|           120|           401.0|
|-0XME_3EJtOIyiTaY...|   11|2012|             6|            24.0|
|-0iqnv7MjKrgh7Q7b...|   04|2017|           365|          1381.0|
|-11j5tUL810_k3Soo...|   11|2019|            21|            68.0|
|-1b5PAJFq9L-EXEzP...|   02|2014|           234|           543.0|
|-1lP2wOmxWidM6Ge4...|   09|2020|            18|            58.0|
|-2CqsOEUICBAkdyu7...|   05|2018|             1|             5.0|
|-2MT2xutGx_6EeZHA...|   08|2013|            19|            95.0|
|-2S-_MGNtX1FcRaTO...|   11|2016|             8|            26.0|
|-2aMFgYft7rED0bst...|   08|2017|             3|            15.0|
|-2joeHbqY9TayADes...|   01|2016|           195|           784.0|
|-2joeHbqY9TayADes...|   05|2019|           302|          1232.0|
|-2joeHbqY9TayADes...|   07|2018|           279|          1141.0|
|-384bwLp3FoesKs7c...|   02|2018|            90|           447.0|
|-3H3oRyCbLuGHVVdq...|   06|2014|             2|            10.0|
|-3VQI-QXI7fr4TWH0...|   10|2016|           757|          2783.0|
+--------------------+-----+----+--------------+----------------+
only showing top 20 rows

                                                                                
In [29]:
review_checkin = (review
                  .join(checkin, ["business_id", "date"], "full")
                  .withColumn("month", F.split(F.col("date"), "-")[1])
                  .withColumn("year", F.split(F.col("date"), "-")[0])
                  .withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
                  .select("business_id", "month", "year", "combined_count")
                  .groupBy(["business_id", "month", "year"])
                  .sum("combined_count"))

review_checkin.sort("year", "month").show()
                                                                                
+--------------------+-----+----+-------------------+
|         business_id|month|year|sum(combined_count)|
+--------------------+-----+----+-------------------+
|F6Qsav77WqGhstQ1E...|   10|2004|                  1|
|uSM_sCp22BJWeatTD...|   10|2004|                  1|
|rdS7hBBeukiX4Led9...|   10|2004|                  1|
|RThn3_Y6qN8MsqZKW...|   10|2004|                  1|
|dr9PgB_kTp998OYWi...|   10|2004|                  2|
|_VkLbBvqBMmDJ9EUO...|   10|2004|                  1|
|Hnx34LxtfW4XV0gZ1...|   10|2004|                  1|
|Aigcg-qqohtSmjDci...|   10|2004|                  2|
|sf04_4kcxi6UkXMUl...|   10|2004|                  1|
|_irBz9yudNSlDUVAe...|   10|2004|                  1|
|GPhnm_rcjkbc_b96D...|   10|2004|                  1|
|VvhGHzjyOdStXblX8...|   10|2004|                  1|
|CGbuorlhKS88aKwhb...|   10|2004|                  1|
|0Ov-g-4eTOTSgVDBf...|   10|2004|                  1|
|pvlM--HZY1a8SqMXi...|   10|2004|                  1|
|54ElwAyN-o8e4uvOk...|   10|2004|                  1|
|w7JNKVadLeZoqfpTG...|   10|2004|                  1|
|bJLeoKF04fHYNoxpa...|   10|2004|                  1|
|j6PJEamZViym5thvz...|   10|2004|                  1|
|JdVlUaXM9aZQ68tjg...|   10|2004|                  1|
+--------------------+-----+----+-------------------+
only showing top 20 rows

In [30]:
enough_data = (review_checkin
               .groupBy("business_id")
               .count())
enough_data.show()
[Stage 85:=====================================================>(198 + 2) / 200]
+--------------------+-----+
|         business_id|count|
+--------------------+-----+
|dW4pD2rGTIhwCLCtP...|   85|
|6qlWc-OVhTQfcyVi3...|   29|
|j2NOygXjaTufrsKFl...|   58|
|88kosdu9NbLZnHrio...|   85|
|D2bmsnOkBIt72WYxO...|   28|
|RhGnmMMEaNm3JwKKD...|   56|
|euJD4G-BgXYY5BgFl...|   76|
|kh9qwwqUcGwgexuiU...|   33|
|FbZLY5XASP9phBySt...|   39|
|q_Hw7TE8IXJtaQICG...|    8|
|6KGBXOeSJYf9ePdyA...|  161|
|8Lnid7N3bcAWY--CI...|   28|
|PBfFmFy8C0WOtpHF_...|  148|
|u7s0q_yGEihxnsS0m...|   27|
|MhYtCVMzARhHpApK4...|  123|
|8A_OhuJIr6k0Czdz4...|   36|
|CeeO1uNYB0uvTNC6M...|   31|
|3KqpiLDAjeeMmZeU-...|   52|
|W60O4ast9uAq03n7n...|  153|
|r7-YSA7Z8p4KISZ-j...|  132|
+--------------------+-----+
only showing top 20 rows

                                                                                
In [31]:
enough_data = (enough_data
               .where(enough_data['count'] >= 24))
enough_data.show()
[Stage 90:===================================================>  (192 + 8) / 200]
+--------------------+-----+
|         business_id|count|
+--------------------+-----+
|dW4pD2rGTIhwCLCtP...|   85|
|6qlWc-OVhTQfcyVi3...|   29|
|j2NOygXjaTufrsKFl...|   58|
|88kosdu9NbLZnHrio...|   85|
|D2bmsnOkBIt72WYxO...|   28|
|RhGnmMMEaNm3JwKKD...|   56|
|euJD4G-BgXYY5BgFl...|   76|
|kh9qwwqUcGwgexuiU...|   33|
|FbZLY5XASP9phBySt...|   39|
|6KGBXOeSJYf9ePdyA...|  161|
|8Lnid7N3bcAWY--CI...|   28|
|PBfFmFy8C0WOtpHF_...|  148|
|u7s0q_yGEihxnsS0m...|   27|
|MhYtCVMzARhHpApK4...|  123|
|8A_OhuJIr6k0Czdz4...|   36|
|CeeO1uNYB0uvTNC6M...|   31|
|3KqpiLDAjeeMmZeU-...|   52|
|W60O4ast9uAq03n7n...|  153|
|r7-YSA7Z8p4KISZ-j...|  132|
|Ukr4_xiBTUPOJ8ySP...|   38|
+--------------------+-----+
only showing top 20 rows

                                                                                
In [32]:
final_df = (review_count_sum
            .join(review_checkin, ['business_id', 'month', 'year'], 'inner')
            .join(enough_data, 'business_id', 'inner'))
final_df.show()
[Stage 101:===================================================> (194 + 6) / 200]
+--------------------+-----+----+--------------+----------------+-------------------+-----+
|         business_id|month|year|sum(rev_count)|sum(running_sum)|sum(combined_count)|count|
+--------------------+-----+----+--------------+----------------+-------------------+-----+
|-36nnCT71XE0InJXK...|   05|2019|            28|            53.0|                  1|   25|
|-36nnCT71XE0InJXK...|   12|2017|            17|            34.0|                  1|   25|
|-36nnCT71XE0InJXK...|   09|2016|             6|            14.0|                  1|   25|
|-36nnCT71XE0InJXK...|   03|2016|             5|             9.0|                  1|   25|
|-36nnCT71XE0InJXK...|   07|2014|             1|             1.0|                  1|   25|
|-36nnCT71XE0InJXK...|   06|2017|            31|            65.0|                  2|   25|
|-36nnCT71XE0InJXK...|   03|2020|            30|            59.0|                  1|   25|
|-36nnCT71XE0InJXK...|   09|2019|            29|            58.0|                  1|   25|
|-36nnCT71XE0InJXK...|   01|2018|            37|            75.0|                  2|   25|
|-36nnCT71XE0InJXK...|   02|2017|            33|            69.0|                  3|   25|
|-36nnCT71XE0InJXK...|   07|2020|            65|           125.0|                  2|   25|
|-36nnCT71XE0InJXK...|   05|2017|            14|            31.0|                  1|   25|
|-36nnCT71XE0InJXK...|   11|2015|             4|             8.0|                  1|   25|
|-36nnCT71XE0InJXK...|   10|2016|            15|            35.0|                  2|   25|
|-36nnCT71XE0InJXK...|   08|2018|            25|            50.0|                  1|   25|
|-36nnCT71XE0InJXK...|   02|2015|             2|             2.0|                  1|   25|
|-36nnCT71XE0InJXK...|   06|2018|            41|            86.0|                  2|   25|
|-36nnCT71XE0InJXK...|   09|2018|            26|            51.0|                  1|   25|
|-36nnCT71XE0InJXK...|   03|2019|            27|            52.0|                  1|   25|
|-36nnCT71XE0InJXK...|   07|2015|             3|             7.0|                  1|   25|
+--------------------+-----+----+--------------+----------------+-------------------+-----+
only showing top 20 rows

                                                                                
In [33]:
x1 = final_df.select('sum(rev_count)').rdd.flatMap(lambda x: x).collect()
x2 = final_df.select('sum(running_sum)').rdd.flatMap(lambda x: x).collect()
y = final_df.select('sum(combined_count)').rdd.flatMap(lambda x: x).collect()
                                                                                
In [34]:
x1 = np.array(x1).reshape(-1,1)
x2 = np.array(x2).reshape(-1,1)
In [35]:
xx = [[i[0],j[0]] for i, j in zip(x1,x2)]
reg = LinearRegression().fit(xx, y)

# print(f"linear regression score: {reg.score(x, y)}")
print(f"linear regression score: {reg.score(xx, y)}")
print(f"coefficient: {reg.coef_}")
print(f"y-intercept: {reg.intercept_}")
linear regression score: 0.00013265360198433385
coefficient: [-5.83676050e-05  9.08684594e-05]
y-intercept: 6.196444054201265

Conclusion¶

Based on reviewing the data and relating the different tables together, we found 3 different attempts at correlating reviews/checkins to business. All three attempts lead to a very low coefficient of determination or "score" which tells us that reviews and ratings have no impact on future business.